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ABSTRACT 



We study the problem of computing shortest path or distance be- 
tween two query vertices in a graph, which has numerous impor- 
tant applications. Quite a number of indexes have been proposed 
to answer such distance queries. However, all of these indexes can 
only process graphs of size barely up to 1 million vertices, which is 
rather small in view of many of the fast-growing real-world graphs 
today such as social networks and Web graphs. We propose an 
efficient index, which is a novel labeling scheme based on the inde- 
pendent set of a graph. We show that our method can handle graphs 
of size three orders of magnitude larger than those existing indexes. 

1. INTRODUCTION 

Computing the shortest path or distance between two vertices is 
a basic operation in processing graph data. The importance of the 
operation is not only because of its role as a key building block 
in many algorithms but also of its numerous applications itself. In 
addition to applications in transportation, VLSI design, urban plan- 
ning, operations research, robotics, etc., the proliferation of net- 
work data in recent years has introduced a broad range of new ap- 
plications. For example, social network analysis, page similarity 
measurement in Web graphs, entity relationship ranking in seman- 
tic Web ontology, routing in telecommunication networks, context- 
aware search in social networking sites, to name but a few. 

In many of these new applications, however, the size of the un- 
derlying graph is often in the scale of millions to billions of vertices 
and edges. Such large graphs are becoming more and more com- 
mon, some of the well-known ones include Web graphs, various 
social networks (e.g., Twitter, Facebook, Linkedln), RDF graphs, 
mobile phone networks, SMS networks, etc. Computing shortest 
path or distance in these large graphs with conventional algorithms 
such as Dijkstra's algorithm or simple BFS may result in a long 
running time that is not acceptable. 

For computing shortest path or distance between two points in 
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a road network, many efficient indexes have been proposed |1 
|l|3][8][rjD[l4]|26]|27]|28)- However, these works apply unique 
properties of road networks and hence are not applicable for other 
graphs/networks that are not similar to road networks. In recent 
years, a number of indexes have been proposed to process distance 
queries in general sparse graphs [10 12, 13. 17, 30 32, 33] . How- 
ever, as we will discuss in details in Section [3] these indexes can 
only handle relatively small graphs due to high index construction 
cost and large index storage space. As a reference, the largest real 
graphs tested in these works have only 58 IK vertices with average 
degree 2.45 [ 10], and 694K vertices with average degree 0.45 [ 17], 
while most of the other real graphs tested are significantly smaller. 

We propose a new index for computing shortest path or distance 
between two query vertices and our method can handle graphs with 
hundreds of millions of vertices and edges. Our index, named as 
IS-LABEL, is designed based on a novel application of the inde- 
pendent set of a graph, which allows us to organize the graph into 
layers that form a hierarchical structure. The hierarchy can be used 
to guide the shortest path computation and hence leads to the design 
of effective vertex labels (i.e., the index) for distance computation. 

We highlight the main contributions of our paper as follows. 

• We propose an efficient index for answering shortest path or 
distance queries, which can handle graphs up to three orders 
of magnitude larger than those tested in the existing works 
fM [if] rjj rjH llQl |21 [33] . None of these existing works 
can handle even the medium-sized graphs that we tested. 

• We design an effective labeling scheme such that the label 
size remains small even if no optimization (mostly NP-hard) 
is applied as in the existing labeling schemes. 

• Our index naturally lends itself to the design of simple and 
efficient algorithms for both index construction and query 
processing. 

• We develop I/O-efficient algorithms to construct the vertex 
labels in large graphs that may not fit in main memory. 

• We verify both the efficiency and scalability of our method 
for processing distance queries in large real-world graphs. 

Organization. Section|2]defines the problem and basic notations. 
Section [3] discusses the limitations of existing works. Sections [4] 
and [5] present the details of index design, and Section [6] describes 
the algorithms. Section|7]reports the experimental results. Section 
[8] discusses various issues such as handling path queries, directed 
graphs, and update maintenance. Section|9]concludes the paper. 



Table 1: Frequently-used notations 



Notation 


Description 


G = (V g ,E g ,ujg) 


A weighted, undirected simple graph 


|G| = (|V G | + |i,' G |) 


The size of G 


u G (u,v) 


The weight of an edge (u, v) in G 


adj q{v) 


The set of adjacent vertices of v in G 


SP G (u,v) 


A shortest path from u to v in G 


distc(u, v) 


The distance from u to v in G 



2. NOTATIONS 

We focus our discussion on weighted, undirected simple graphs. 
Let G = (Vg, Eg,^g) be such a graph, where Vg is the set of 
vertices, Eg is the set of edges, and ujg ■ Eg — ► N + is a function 
that assigns to each edge a positive integer as its weight. We denote 
the weight of an edge (it, v) by oj(u,v). The size of G is defined 
as \G\ = (\V G \ + \E G \). 

We define the set of adjacent vertices (or neighbors) of a vertex 
v in G as adj G (v) = {u : (u, v) £ Eg}, and the degree of v in G 
as deg G (v) = \ adj G (v)\. 

We assume that a graph is stored in its adjacency list representa- 
tion (whether in memory or on disk), where each vertex is assigned 
a unique vertex ID and vertices are ordered in ascending order of 
their vertex IDs. 

Given a path p in G, the length of p is defined as len(p) — 
Yleep w o(e), i.e., the sum of the weights of the edges on p. Given 
two vertices u, v 6 Vg, the shortest path from u to v, denoted by 
SPg{u, v), is a path in G that has the minimum length among all 
paths from u to v in G. We define the distance from u to v in G 
as distG(u,v) = len(SPG(u,v)). We define distc(v,v) = for 
any u € Vg- 

Problem definition: we study the following problem: given a 
graph G = (Vg, Eg,ojg), construct a disk-based index for pro- 
cessing point-to-point (P2P) shortest path or distance queries, i.e., 
given any pair of vertices (s, t) £ (Vg x Vg), find distG(s, t). 

We focus on sparse graphs, since most large and many fast grow- 
ing real-world networks are sparse. We will focus our discussion 
on processing P2P distance queries. Computing the actual path will 
be a fairly simple extension with some extra bookkeeping, which 
will be discussed in Section [8] where we will also show that our 
index can be extended to handle directed graphs. 

Table[T]gives the frequently-used notations in the paper. 

3. LIMITATIONS OF EXISTING WORK 

We highlight the challenges of computing P2P distance by dis- 
cussing existing approaches and their limitations. 

3.1 Indexing Approaches 

Cohen et al. 1131 proposed the 2-hop labeling that com- 
putes for each vertex v two sets, Li n (v) and L ou t(v), where 
for each vertex u G Li n (v) and w £ L ou t(v), there is a 
path from u to v and from v to w. The distances distG(u,v) 
and distG(v,w) are pre-computed. Given a distance query, s 
and t, the index ensures that dista(s,t) can be answered as 
mm ve(Lout ( s)nLin{t)) {dist G (s,v) + dist G (v,t)}. However, com- 
puting the 2-hop labeling, including the heuristic algorithms 1121 
1301 , is very costly for large graphs. Moreover, the size of the 2-hop 
labels is too big to be practical for large graphs. 

Xiao et al. 1 33 ] exploit symmetric structures in an unweighted 
undirected graph to compress BFS trees to answer distance queries. 
However, the overall size of all the compressed BFS trees is pro- 



hibitively large even for medium sized graphs. 

Wei [32 1 proposed an index based on a tree decomposition of 
an undirected graph G, where each node in the tree stores a set of 
vertices in G. The distance between each pair of vertices stored in 
each tree node is pre-computed, so that queries can be answered 
by considering the minimum distance between vertices stored in a 
simple path in the tree. However, the pair-wise distance compu- 
tation for vertices stored in the tree nodes, especially in the root 
node, is expensive and requires huge storage space. As a result, the 
method cannot scale to handle large graphs. 

Recently Chang et al. [ilOj also applied tree decomposition to 
compute multi-hop labels that trade query efficiency of 2-hop labels 
1131 for indexing cost. Similar to [32], tree decomposition is an 
expensive operation and the graphs that can be handled by their 
method are still relatively small. 

Jin et al. 1 17 1 proposed to use a spanning tree as a highway struc- 
ture in an directed graph, so that distance from s to t is computed 
as the length of the shortest path from s to some vertex u, then 
from u via the highway (i.e., a path in the spanning tree) to some 
vertex v, and finally from v to t. Every vertex is given a label so 
that a set of entry points in the highway (e.g., u) and a set of exit 
points (e.g., v) can be obtained. However, the labeling is too costly, 
in terms of both time and space, for the method to be practical for 
even medium sized graphs (e.g., one step in the process requires all 
pairs shortest paths to be computed and input to another step). 

The problem of P2P distance querying has been well studied for 
road networks. Abraham et al. |2| recently proposed a hub-based 
labeling algorithm, which is the fastest known algorithm in the road 
network setting. This method incorporates heuristical steps in dis- 
tance labeling by making use of the concepts of contraction hierar- 
chies 1 14 1 and shortest path covers [131. There are other fast algo- 
rithms such as [271, 1 14 1, and 1 8 1, that are also based on the concept 
of a hierarchy of highways to reduce the search space for comput- 
ing shortest paths. However, it has been shown in 1 3 1 and |T) that 
the effectiveness of these methods relies on properties such as low 
VC dimensions and low highway dimensions, which are typical in 
road networks but may not hold for other types of graphs. Another 
approach is based on a concise representation of all pairs shortest 
paths [26 28 1 . However, this approach heavily depends on the spa- 
tial coherence of vertices and their inter-connectivity. Therefore, 
while P2P distance querying has been quite successfully resolved 
for road networks, these methods are in general not applicable to 
graphs from other sources. 

Cheng et al. 1 1 lj proposed an index for computing the distance 
from a source vertex to all other vertices, which can be used to com- 
pute P2P distance, but much computation will be wasted in com- 
puting the distances from the source to many irrelevant vertices. 

3.2 Other Approaches 

When the input graph is too large to fit in main memory, ex- 
ternal memory algorithms can be used to reduce the high disk I/O 
cost. Existing external memory algorithms are mainly for comput- 
ing single-source shortest paths [18] [22] [23] [20j [21] or BFS HID 
I9l ll9ll24l . which are wasteful for computing P2P distance. In ad- 
dition, external memory algorithms are very expensive in practice. 

There are also a number of approximation methods (7] [15] 1251 
129113 1 1 proposed to compute P2P distance. Although these methods 
have a lower complexity than the exact methods in general, they 
are still quite costly for processing large graphs, in terms of both 
preprocessing time and storage space. We focus on exact distance 
querying but remark that approximation can be applied on top of 
our method (e.g., on the graph Gk defined in Section[5]l. 



4. QUERYING DISTANCE BY VERTEX 
HIERARCHY 

In this section, we present our main indexing scheme, which con- 
sists of the following components: 

• A layered structure of vertex hierarchy constructed from the 
input graph. 

• A vertex labeling scheme developed from the vertex hierar- 
chy. 

• Query processing using the set of vertex labels. 

We discuss each of these three components in Sections[4j]to[43] 

4.1 Construction of Vertex Hierarchy 

The main idea of our index is to assign hierarchy to vertices in an 
input graph G so that we can use the vertex hierarchy to compute 
the vertex labels, which are then used for querying distance. 

To create hierarchies for vertices in G, we construct a layered 
hierarchical structure from G. To formally define the hierarchical 
structure, we first need to define the following two important prop- 
erties that are crucial in the design of our index: 

• Vertex independence: given a graph H = (Vb,Eb, ojb), 

and a set of vertices I, we say that I maintains the vertex 
independence property with respect to H if I C Vh and 
Vit, v £ I, (u, v) ^ Eh, i.e., I is an independent set of H . 

• Distance preservation: given two graphs Hi — 
[Vhx , E Hl , WiTi ) and H 2 = (Vh 2 , Eh 2 , uh 2 ), we say that 
H2 maintains the distance preservation property with respect 

to H\ if Vit, v £ Vh 2 , distH 2 ( u , v ) — distHi{u,v). 

While distance preservation is essential for processing distance 
queries, vertex independence is critical for efficient index construc- 
tion as we will see later when we introduce the index. 

We now formally define the layered hierarchical structure, fol- 
lowed by an illustrating example. 

Definition 1 (Vertex Hierarchy). Given a graph G = 
(Vg, Eg,ug), a vertex hierarchy structure of G is defined by a 
pair (L, G), where L = {Li, . . . , Lh} is a set of vertex sets and 
G = {Gi, . . . , Gh} is a set of graphs such that: 

• Vg = Li U . . . U L h , and Li n Lj = %for 1 < i < j < h; 

• For 1 < i < h, each Li maintains the vertex independence 
property with respect to Gi, i.e., Li is an independent set of 

Gi; 

• Gi = G, and for 2 < i < h, let Gi = (Vg 4 , E^ , ^G;), 
then Vd ~ (Vg — Li — ... — Li~i), whereas Eci anduJGi 
satisfy the condition that Gi maintains the distance preser- 
vation property with respect to G»_i. 

Intuitively, L is a partition of the vertex set Vg and represents 
a vertex hierarchy, where Li is at a lower hierarchical level than 
Lj for i < j. Meanwhile, each Gi £ G preserves the distance 
information in the original graph G, as shown by the following 
lemma. 

LEMMA 1. For all u,v £ Vbj. where 1 < i < h, 
distGi{u,v) — distG(u,v). 



PROOF. Since for any it, v £ Vd , u,v £ Vg, for 1 < j < 
i. Thus, we have distGi(u,v) = distGi^x (u, v) = ... = 
distd {u, v) — distciu, v) since each Gi maintains the distance 
preservation property with respect to Gi-i for 2 < i < h. □ 

We use the following example to illustrate the concept of vertex 
hierarchy. 

EXAMPLE 1. Figure\l\shows a given graph G and the vertex 
hierarchy of G. We assume that each edge in G has unit weight 
except for (e, /), which has a weight of 3. It is obvious that the set 
{c, /, i} forms an independent set in G, similarly {b, d, h} in G2 
and {e} in G3. It is easy to see that G2 preserves all distances in 
G, we shall explain the addition of edge (e,h) later. In order to 
preserve the distance in G2, an edge (e, g) of weight 2 is added to 
G3. Gi consists of a single edge (a, g) of weight 3. L4 = {a}, G5 
consists of a single vertex g, L 5 — {g}. 




(c) L 2 = { b, d, h } (d) G 3 , L 3 ={e] (c) G 4 , L 4 = {a} (f) G s 



Figure 1: A vertex hierarchy 

The distance preservation property can be maintained in Gi with 
respect to Gi-i as follows. First, we require the subgraph of Gi-i 
induced by the vertex set Vc i to be in Gi (i.e. («, v) £ -Eg ; iff 
(u, v) £ 7?Gi_i for it, v £ Vd ). Then, we create a set of addi- 
tional edges, called augmenting edges, to be included into E^ as 
follows. For any vertex v £ (thus v ^ Vg ( according to Def- 
inition!]}, if u, to £ Vg;, (u, v) £ EG i _ 1 and (v,w) £ EG i _ 1 , 
then an augmenting edge (u, w) is created in Gi with LOGi (u, tu) = 
cjg;_i (w, v) + ojGi^xiv, w). If (it, w) already exists in Gi, then 
WGiKu) = min([JG i _i(u, w),u)G i _ 1 {u,v) + u)G i _ 1 {v,w)). 
An edge in Gi with updated weight is also called an augmenting 
edge. For example, in Figure [T] in G3, dist(e,g) can be pre- 
served by creating an augmenting edge (e,g) with uu(e,g) = 2. 
Edge (e, h) is also added according to our process above. Note that 
distd (e, h) — 3, which can be preserved in G2 without adding 
(e, h), but we leave (e, h) there to avoid costly distance querying 
needed to exclude (e, h). 

The following lemma shows the correctness of constructing Gi 
from Gi-i as discussed above. 

LEMMA 2. Constructing Gifrom Gi-i, where 2 < i < h, by 
adding augmenting edges to the induced subgraph ofGi-i by Vg^ 
maintains the distance preservation property with respect to Gj_i, 

PROOF. According to Definition [T] Li-i is the only set of 
vertices that are in Gi-i but missing in Gi. For any two ver- 
tices s and t in Gi, suppose that the shortest path (in Gi-i) 
from s to t, SPG i _ 1 (s,t) does not pass through any vertex in 



Li-i, then the distance between s and t in d-i is trivially pre- 
served in d. Next suppose SPc i _ 1 (s,t) passes through some 
vertex v G Let SPd _i (s, i) = {s, . . . ,u,v,w, . . . ,t). 

Then, we must have the augmenting edge (it, n>) created in Gi 
with 0JG i (it, w) = UGi-i (u, v) + UGi-i (v, w), or wg 4 (u, if) = 
min(tJGi_i (w, u>), tJGi_i (ft, v) + ^Gi_i (f, u>)) if (it, 10) already 
exists in Gi. Therefore, the distance (in Gi-i) between any two 
vertices is preserved in Gi. □ 

In addition to the distance preservation property that is required 
for answering distance queries, the proof also gives a hint on why 
we require each Li to be an independent set of Gi. Since there is 
no edge in Gj_i between any two vertices in to create an 

augmenting edge (it, w) in Gi we only need to do a self-join on 
the neighbors of the vertex v G Li-i. Thus, the search space is 
limited to 2 hops from each vertex. On the contrary, if an edge can 
exist between two vertices in Li-i, then to preserve the distance 
the search space is at least 3 hops from each vertex, which is sig- 
nificantly larger than the 2-hop search space in practice. This is 
crucial for processing a large graph that cannot fit in main memory 
as we may need to scan the graph many times to perform the join, 
as we will see in Section|6] 

4.2 Vertex Labeling 

With the vertex hierarchy (L, G), we now describe a labeling 
scheme that can facilitate fast computation of P2P distance. We 
first define the following concepts necessary for the labeling. 

• Level number: each vertex v 6 Vg is assigned a level num- 
ber, denoted by £(v), which is defined as£(v) = jiff v € Li, 

• Ancestor: a vertex it £ Vg is an ancestor of a vertex v if 
there exists a sequence S = (v = Wi, ti>2, — , w p = it), such 
that £(wi) < £(102) < ... < £(w p ), and for 1 < i < p, 
the edge (wi,Wi+i) G Eg } where j = £{wi). Note that v 
is an ancestor of itself. If u is an ancestor of v, then v is a 
descendant of it. 

EXAMPLE 2. In our example in Figure\l\ the level numbers of 
c, f, i are I, that of b, d, h are 2, that of e is 3. The ancestors of 
f will be e, h, a, g, since (/, e) and (/, h) are in Gi, (h, g) is in 
G2, and (e,a), (e,g) are in G3. Note that d is not an ancestor 
of f since in the path (/, e, d), £(e) = 3 while £(d) = 2. The 
ancestor-descendant relationships are shown in Figure^a). 

We now define vertex label as follows. 

Definition 2 (Vertex Label). The label of a vertex v G 
Vg, denoted by LABEL(v), is defined as LABEL(v) = 
{(it, distdv, u)) : u € Vg is an ancestor of v}. 

To compute LABEL(v) for all v G Vg, we need to compute 
the distance from v to each of v's ancestors. This is an expensive 
process which cannot be scaled to process large graphs. To address 
this problem, we define a relaxed vertex label that requires only an 
upper-bound, d(v, u), of distciv, it) and show that d(v, it) suffices 
for answering distance queries. 

Definition 3 (Relaxed Vertex Label). The relaxed 
label of a vertex v G Vg, denoted by label(v), is a set of 
"(u,d(v,u))" pairs computed by the following procedure: 
For each v G Vg, we first include (i>,0) in label (v) and 
mark v. Then, we add more entries to label(v) recursively 
as follows. Take a marked vertex u that has the smallest 
level number £(u), and unmark u. Let £(u) — j. For each 



w G adj G .(u), where £(yS) > j and (w,d(v,w)) ^ label(v), 
add the entry (w, (d(v,u) + ojgAu, w))) to label(v), and mark 
w. If the entry (w,d(v,w)) is already in label(v), update 
d(v,w) — min(d(i>, w), (d(v, u) + wg,-(u,w))). Repeat the 
above recursive process until no more vertex is marked. 

As for LABEL(v), label(v) contains entries for all ancestors 
of v. In Section[6] we will show that the new definition facilitates 
the design of an I/O-efficient algorithm for handling large graphs. 
Here, we further illustrate the concept using an example, and then 
prove that label (v) can indeed be used instead of LABEL(v) to 
correctly answer P2P distance queries in the following subsection. 

EXAMPLE 3. For our example in Figure [7] the ancestor re- 
lationships are shown in Figure \2%a ), where all edges have unit 
weights unless indicated otherwise. The labeling starts with L\, 
for vertices c, f, i, next L2 vertices b, d, h are labeled, followed by 
Lz = {e}, 1/4 = {a}, and L5 = {g}. Consider the labeling 
for vertex c, first, (c, 0) is included, since adjcic) — {&}, (6,1) 
is added to label(c) and b is marked, b is unmarked by checking 
its neighbors a and e in G2, and we include both (a, 2), (e, 2) into 
label(c), a and e are marked, e is at level 3 and is unmarked next. 
adjG 3 (e) = {a, g}, we add (g, 4) to label(c). Then a is unmarked, 
its only neighbor g in Ga is already in label(c), d(c, g) is not up- 
dated, g is marked. Finally g is unmarked, since g has no neighbor 
in G5, no further processing is required. The labels for all vertices 
are shown in Figure\2j(b). Note that d(h, e) = 4 in label(h), while 
distG(h, e) — 3, hence d(h, e) > distaih, e). In general the dis- 
tance value in a label entry can be greater than the true distance. 




label(c) 


{(a,2),(M),(c,0),( e ,2),( S ,4)) 


label(f) 


{(a,4),(e,3),(/, 0), (9, 5), (h, 1) } 


labelli) 


{(a,2),(e,l),( 9 ,3),M)) 


label(b) 


{(a,l),(fe,0),(e, 1),(9,3)) 


labeled) 


{(a,2),(d,0),(e,l),(g,l)) 


label(h) 


{(a,5),(e,4),( 9 ,l),(h,0)) 


label(e) 


{(a,l),(e,0),( 9 ,2)} 


label(a) 


{(a, ()),(<?, 3)} 


label(g) 


{(9,0)) 



(b) 

Figure 2: Labeling for the example in FigureQ] 

4.3 P2P Distance Querying 

We now discuss how we use the vertex labels to answer P2P 
distance queries. We first define the following label operations used 
in query processing. 

• Vertex extraction: V[label(v)] = {u : (u, d(v, it)) G 
label(v)}. 

• Label intersection: label(u) PI label(v) = V[label(u)\ n 
V[label(v)], 

The above two operations apply in the same way to LABEL(.). 
Given a P2P distance query with two input vertices, s and t, let 
X = label(s) n label(t), the query answer is given as follows. 



distds, t) — 



min„ e x{rf(s, w) + d(w, t)} ifX^i 
00 if X = 1 



(1) 



In EquationQ] we retrieve d(s, w) and d(t, w) for each w G X 
from label(s) and label(t), respectively. We give an example of 
answering P2P distance queries using the vertices labels as follows. 



EXAMPLE 4. Consider the example in Figure^ the labeling is 
shown in Figure\2\ Suppose we are interested in dist G (h, e). We 
look up label(h) and label(e). label(h) n label(e) = {e,a,g}. 
Among these vertices, g has the smallest sum of d(h, g) J r d(g, e) 

! Hence we return 3 as distc{h, e). Note that although 
the distance d(h, e) recorded in label(h) is 4, which is greater than 
distG (h, e), the correct distance is returned. If we want to find 
dist G (a, g), label(a) D label(g) — {g}. Hence dist G (a, g) is 
given by d(a, g) + d(g, g) — 3 + = 3. 

Query processing using the vertex labels is simple; however, it 
is not straightforward to see how the answer obtained is correct 
for every query. In the remainder of this section, we prove the 
correctness of the query answer obtained using the vertex labels. 

We first define the concept of max-level vertex, denoted by 
Vmax, of a shortest path, which is useful in our proofs. Given a 
shortest path from s to i in G, SP G (s, t) = (s = vi, V2, • • • , v p = 
t), Vmax is the max-level vertex of SP G (s,t) if v max is a vertex 
on SP G (s,t) and £(v max ) > £(vi) for 1 < i < p, The following 
lemma shows that v max is unique in any shortest path. 

LEMMA 3. Given two vertices s and t, if SP G (s,t) exists, then 
there exists a unique max-level vertex, v max , of SP G (s,t). 

PROOF. First, since SP G {s,t) exists, v max must exist on 
SP G (s,t). Now suppose to the contrary that v max is not unique, 
i.e., there exists at least one other vertex v on SPg(s, t) such that 
i(vma X ) = £(v) = j, which also means that both v max and v 
are in Lj and Gj. Since Lj is an independent set of Gj, there is 
no edge between v max and v in Gj . Since v max and v are on the 
same path SPg(s, t), they must be connected in Gj and the path 
connecting them must pass through some neighbor u of v, nax or 
v in Gj, where u is also on SP G (s,t). Thus, u cannot be in Lj 
(otherwise the vertex independence property is violated) and hence 
£(u) > £(v max ), which contradicts that v max is the max-level ver- 
tex of SP G (s, t). □ 

Next we prove that LABEL(.) can be used to correctly answer 
P2P distance queries. Then, we show how label(.) possesses the 
essential information of LABEL(.) for the processing of distance 
queries. 

THEOREM 1. Given a P2P distance query with two input 
vertices, s and t, let X = LABEL(s) fl LABEL(t), then 
disto(s,t) = mm W £]i{dist G (s,w) + dist G (t,w)} if X ^ 0, 
or dist G (s,t) = oo ;/X = 0. 

PROOF. We first show that if SP G (s, t) exists, then v max £ X. 
Consider a sequence of vertices, S = (s = ui , u 2 , . . . , u a = 
v,na X = vp, . . . ,V2,vi = t) , extracted from SP G (s, t), such that 
e(ui) < £(u 2 ) < ... < £{u a ) = £{v max ), £( Vl ) < £(v 2 ) < ... < 
£{vp) = £(v max ), and for 1 < i < a, any vertex w between Ui 
and Ui+i on SPa(s, t) has £(w) < £{ui), and same for any vertex 
between Vi and Vi+i . Note that since Ui+i is the next vertex after Ui 
with £(u i+ i) > £{ui), we have £(w) < £{ui), and £{w) / £(ui) 
by the vertex independence property. 

Since Ui and iti+i are connected, they must exist together in 
Ge(ui)- Since there exists no other vertex w between Ui and Ui+i 
on SPa(s,t) such that £(w) > £(ui), Ui and iii+i are not con- 
nected by any such w in G^^). Thus, by Lemma Q] the edge 
(ui, Ui+i) must exist in G^( u .) for G^ u .) to preserve the distance 
between Ui and Ui+i, which means that for 1 < j < a, Uj is 
an ancestor of s and hence Uj 6 LABEL(s). Note that ui = 
s e LABEL(s) if a = 1. Similarly, we have Wj G LABEL(t), 
for 1 < i < /?. Thus, v max — u a = vp € X and hence 
dist G {s,i) = dist G (s,v max ) + dist G (t,v max ). 



The other case is that SP G (s, t) does not exist, i.e., s and t are 
not connected, and we want to show that X = 0. Suppose on the 
contrary that there exists w £ X. Then, it means that there is a path 
from s to w and from t to w, implying that s and t are connected, 
which is a contradiction. Thus, X = and dist G (s,t) — oo is 
correctly computed. □ 

Theorem Q] reveals two pieces of information that are essential 
for answering distance queries: the ancestor set and the distance to 
the ancestors maintained in LABEL(.). We first show that label(.) 
also encodes the same ancestor set of LABEL(.). 

LEMMA 4. For each v G Va, V[label{v)\ = V[LABEL(v)]. 

PROOF. First, we show that if w £ V[LABEL(v)], i.e., w is an 
ancestor of v, then w £ V[label(v)]. According to the definition 
of ancestor, there exists a sequence S = {v = Wi, w 2 , w p — 
w), such that £(wi) < £(w 2 ) < ... < l(w p ), and for 1 < i < 
p, (wi,u>i+i) £ Ea t ... This definition implies that if Wi is 
currently in V[label(v)], u>i+i will also be added to V[label(v)] 
according to Definition[3] Since wi = v must be in V[label(v)], it 
follows that w — Wp is also in V[label(v)]. 

Next, we show that if to £ V[label(v)], then w £ 
V[LABEL(v)]. First, we have v £ V[label(v)], v is also in 
V[LABEL(v)]. Then, according to Definition [5] a vertex w is 
added to V[label(v)] only if to £ adj G (u) for some u cur- 
rently in V[label(v)], and £(w) > £(u), and since u is an an- 
cestor of v, it implies that to is an ancestor of v and hence to £ 
V[LABEL{v)\. □ 

Next, we show that label(.) also possesses the essential distance 
information for correct computation of P2P distance. 

LEMMA 5. Given a P2P distance query, s and t, let X 

n label(t). If SP G (s,t) exists, then v max £ X, 
d(s,v max ) = dist G (s,v max ) and d(t,v max ) = dist G (t,v max ). 

PROOF. It follows from Lemma [4] that label(s) n label(t) = 
LABEL(s) (~1 LABEL(t). As the proof of Theorem [j] shows that 
v max £ LABEL(s) n LABEL(t), we also have v max £ X. 

The proof of Theorem [T] defines a sequence, S = (s — 
Ui,U2,...,u a = v max = V0,...,V2,vi = i), extracted from 
SP G (s, t). In particular, the proof shows that the edge (m, Uj+i) 
exists in Gh u \ and > ^(wi), for 1 < i < a. Thus, 

according to Definition [3] we add the entry (wj+i, (d(s,Ui) + 
u G({u Jv,i,Ui+i))) to label (s). Since each aJG f( „ l} 
preserves the distance between «j and Uj+i, and d(s,ui) = 
distG {s, ui), it follows that d(s, v max = u a ) = dist G (s,v max = 
u a ). Similarly, we have d(t, w mt ra) = dist G (t, v max ). □ 

Finally, the following theorem states the correctness of query 
processing using label(.). 

THEOREM 2. Given a P2P distance query, s and t, dist G (s,t) 
evaluated by Equation\l\is correct. 

PROOF. The proof follows directly from TheoremQ] Lemmas[4] 
and[5] □ 

5. A K-LEVEL VERTEX HIERARCHY 

In Definition Q] we do not limit the height h of the vertex hier- 
archy, i.e., the number of levels in the hierarchy. This definition 
ensures that an independent set Li can always be obtained for each 
Gi, for 1 < i < h. However, there are two problems associated 



with the height of the vertex hierarchy. First, as the number of lev- 
els h increases, the label size of the vertices at the lower levels (i.e., 
vertices with a smaller level number) also increases. Since vertex 
labels require storage space and are directly related to query pro- 
cessing, there is a need to limit the vertex label size. Second, as we 
will discuss in Section [6] the complexity of constructing the ver- 
tex hierarchy is linear in h. Thus, reducing h can also improve the 
efficiency of index construction. 

In this section, we propose to limit the height h by a fc-level 
vertex hierarchy, where fc is normally much smaller than h, and 
discuss how the above-mentioned problems are resolved. 

5.1 Limiting the Height of Vertex Hierarchy 

The main idea is to terminate the construction of the vertex hi- 
erarchy earlier at a level when certain condition is met. We first 
define the fc-level vertex hierarchy. 

Definition 4 (k-level Vertex Hierarchy). Given 
a graph G = (Vg,Eq,u>g), a vertex hierarchy structure 
HI = (L, G) of G, and an integer fc, where 1 < k < (h + 1) and h 
is the number of levels in H, a k-level vertex hierarchy structure of 
G is defined by a pair (H<fc, Gk), where M < k and Gk are defined 
as follows: 

• H<fc = (L<fe, G<fc) consists of the first (fc — 1) levels o/H, 
i.e., h <k — {Li, . . . , Z/fc_i} and G <k = {Gi, . . . , Gt-i}; 

• Gk is the same Gk as the Gk in G. 

The fc-level vertex hierarchy simply takes the first (k — 1) Li £ 
L, for 1 < i < k, and the first k Gi £ G, for 1 < i < k. 
We set the value of k as follows: let i be the first level such that 
(|Gi|/|G<_i|) > cr, where o (0 < a < 1) is a threshold for the 
effect of Gi \ then, k — i. 

If k = (h+1), then M < k is simply H and Gk is an empty graph. 
In practice, a value of o that attains a reasonable indexing cost and 
storage usage will often give k <C h. 

For the fc-level vertex hierarchy, we assign the level number 
£(v) = i for each vertex v £ L(i), where 1 < i < (k — 1), 
while for each vertex v £ Vc k , we assign t(v) — k. In this 
way, we can compute label(v) (or LABEL(v)) for each vertex 
v £ Vg in the same way as discussed in Section l4~2l Note that 
label(v) — {(«, 0)} for each vertex v £ Vc k since v has the high- 
est level number among all vertices in Vg ■ 




(a)G = G,,L,={c,f,i} (b)G 2 
Figure 3: A fc-level vertex hierarchy (fc = 2) 

EXAMPLE 5. Let us consider our running example in Figure\l\ 
if we set k = 2, there is only one level L\ in L<fe, the graph G2 
is the highest level graph and is not further decomposed. The k- 
level vertex hierarchy is shown in Figure\3\ The maximum level of 
vertices is 2, since all vertices v in G2 are assigned i(v) = 2. The 
labels for the vertices in L\ are shown in the following table. 
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5.2 P2P Distance Querying by k-Level Vertex 
Hierarchy 

According to Section |5T1 t[v) and label (v) computed from the 
fc-level vertex hierarchy may be different from those computed 
from the original vertex hierarchy. However, we show later in this 
section that these labels are highly useful for they capture all the 
information that is essential from G — Gk for a continued distance 
search in Gk- Given a P2P distance query, s and t, we process 
the query according to whether s and t are in Gk- We have the 
following two possible types of queries. 

Type 1: s $ V Gk and t <£ V Gk , and either (V[label(s)] n V Gk ) = 
or (V[label(t)] n Vg,.) = 0- Type 1 queries are evaluated by 
Equation Q] 

Type 2: queries that are not Type 1. Type 2 queries are evaluated 
by a label-based bi-Dijkstra search procedure. 

We have discussed query processing by Equation Q] in Section 
14.31 We now discuss how we process Type 2 queries as follows. 

5. 2. 1 Label-based bi-Dijkstra Search 

We describe a bidirectional Dijkstra's algorithm that utilizes ver- 
tex labels for effective pruning. The algorithm consists of two main 
stages: (1) initialization of distance queues and pruning condition, 
and (2) bidirectional Dijkstra search. 

As shown in Algorithm Q] we first initialize & forward and a re- 
verse min-priority queue, FQ and RQ, which are to be used for run- 
ning Dijkstra's single-source shortest path algorithm from s and i, 
respectively. For any vertex v £ V Gk , if (v,d(s,v)) £ label(s), 
we add (v, d(s, v)) to FQ with d(s, v) as the key. For all other ver- 
tices in V Gk but not in label (s), we add the record (v, 00) to FQ. 
Similarly, we initialize RQ. 

The vertex labels can also be used for pruning the search space. 
If there exists a path between s and t that passes through some 
vertex w £ (Vg — Vg h — {s, t}), then Lines 5-6 initializes p, as 
the minimum length of such a path. Note that /x > distds, t). 

We now describe Stage 2 of the query processing. We run Di- 
jkstra's algorithm simultaneously from s and t by extracting the 
vertex v with the minimum key from FQ or RQ (Line 9). Let 
(v,d(x,v)) be the extracted record, where x = s if the record 
is extracted from FQ and x — t otherwise. At this point, Dijkstra's 
algorithm guarantees that the distance from x to v is found, i.e., 
d(x,v) — dtstc{x,v). Then, in Lines 13-18, the distance from 
x to every neighbor u of v in Gk is updated, if u is still in FQ (if 
x = s) or RQ (if x = t). 

In addition to starting the search in both directions from s and 
t in Dijkstra's algorithm, we also add a pruning condition in Line 
8 that requires the sum of the minimum keys of FQ and RQ to be 
less than fj,, If this sum is not less than /1, then it means that no 
path from s to t of a shorter distance than /1 can be found (proved 
in Theorem[4j and hence we return distc(s, t) — fi. 

To improve the pruning effect so as to converge the search 
quickly, we keep updating fj, whenever d(x,u) is updated if 
distc{x' , it) has been found (Lines 17-18), since u is a poten- 
tial vertex on SP G (s,t). We use a set S to keep a set of vertices 
whose distance from s or t has been found. Whenever distdx, v) 
is found for a vertex v, if v is not yet in S, we insert v, together 
with dtstc(x,v), into S. 

We give an example to illustrate how queries are processed as 
follows. 

EXAMPLE 6. Let us consider Example\5\ Suppose we need to 
process a distance query between vertices c and i, i.e. s — c, t = i. 
In label(c), b is in Gk, and therefore we enter (b,d(c,b) — 1) 



Algorithm 1: Label-based bi-Dijkstra Search 

Input : s, t, label(s), label(t), Gk 
Output : distc(s,t) 

II Stage 1: initialization of distance queues 

and pruning condition 
// FQ (RQ) : forward (reverse) min-priority 

queue 

1 initialize FQ with the set {(v, d(s, v)) : v £ Vo k , 
(v, d(s, v)) £ label(s)}, with d(s, v) as the key; 

2 initialize RQ with the set {(v, d(t, v)) : v £ Vo k , 
(v, d(t, v)) £ label(t)}, with d(t, v) as the key; 

3 V v £ Vg and v not in FQ(RQ), insert (v, oo) into FQ(RQ); 

II fi: shortest distance from s to t found so 
far 

// fi is used for pruning in Stage 2 

4 [I oo; 

s X <- label(s) n label(t); 

6 if X 7^ then p, min we x{d(s, w ) + d(w,t)}; 

II Stage 2: bidirectional Dijkstra search 

8 while both FQ and RQ are not empty, and 
(min(-FQ) + min(RQ)) < pAo 
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(li, d(:r, v)) <— extract-miniFQ , RQ) ; // a; = s or x ■■ 
let s' = t if a: = s, and x' = s if x = t; 
if (v, distG{x, v)) is not in S then 
|^ insert (v, distG{%, ")) into S; 

foreach u £ adj Gk (v) do 

if m) > d(x, v) + tJG fc u) then 
u) <— d(x, v) + w<3 fc (t>, u); 
update -u) in FQ (if x = s) or (if x = i)S 
if {«, distG(%' ,«)) « in S then 

/i <— min{/i, d(x, u) + disi(j(x', m)}; 



19 return /x; 



into FQ. In label(i), e is in Gk, hence we enter (e, d(i, e) = 1) 
into RQ. label(c) n label(i) = <ji>, hence fj, = oo a/fer 5to^e 
i o/A/^on'//im[7] Stage 2, let us extract (6, 1) /ram FQ first, 
{b, 1) w inserted into S, and we enter (a, 2), (e, 2), info -FQ. ./Vex/ 
we extract (e, 1) from RQ, and insert (e, 1} into S. (a, 2), (d, 2), 
(6, 2) are entered into RQ. Since b is in S, we update /x to 2 + 1 
= 5. At this point (min(FQ) + min(RQ)) > fi and we return 
distcic, i) — 3. 

5.2.2 Correctness 

We now prove the correctness of query processing by the fc-level 
vertex hierarchy. We first prove the correctness for processing Type 
1 queries. 

THEOREM 3. Given a P2P distance query, s and t, if the query 
belongs to Type 1, then distG(s,t) evaluated by Equation\l\is cor- 
rect. 

PROOF. First, we show that if the query belongs to Type 1, then 
SPG(s,t) does not contain any vertex in Vc k - Suppose on the 
contrary that SPc(s,t) contains a vertex in VG k - Then, consider 
the sub-path of SPg(s, t) from s to x, where x is the only vertex 
on the sub-path that is in Vg,. ■ Since SPg (s, t) is a shortest path in 
G, this sub-path is a shortest path from s to x in G. Let SPg (s, x) 
be the sub-path. Consider the query with two input vertices s and 
x; then, by similar argument as in the proof of Lemma [3] we have 
Vmax = x on SPg (s, x) , and by similar argument as in the proof of 
Lemma[5]we have x — v max £ V[label(s)]. A symmetric analysis 
on the sub-path from t to some vertex y, where y is the only vertex 
on the sub-path that is in Vc k , shows that y — v max on SFc(i, y) 



and y £ V[label(t)]. This contradicts the definition of Type 1 query 
that either (V[label(s)] n V G J = or (V[label(t)] n V Gk ) = 0. 

Now if SPg(s, i) does not contain any vertex in Va k , then the 
query can be answered using only label entries of vertices from the 
first (k — 1) levels of the vertex hierarchy. These entries will have 
identical occurrences and contents in the vertex labels at the first k 
levels of any vertex hierarchy H<j, where k < j < h + 1, which 
is formed by limiting the height of a given H. Thus, the correctness 
of query answer follows from Theorem[2] □ 

Note that Type 1 queries exist only if there exist more than one 
connected component in G such that all vertices in some connected 
component(s) have a level number lower than k. 

Next we prove the correctness for processing Type 2 queries. 

THEOREM 4. Given a P2P distance query, s and t, if the query 
belongs to Type 2, then distG(s,t) evaluated by the label-based 
bi-Dijkstra search procedure is correct. 

PROOF. We have two cases: (1) SPg (s, i) does not contain any 
vertex in VG k , or (2) otherwise. 

If SPG(s,t) does not contain any vertex in Vc k , then 
distG(s,t) is computed in Lines 5-6 of Algorithm [TJ or in other 
words by EquationfTJ As explained in the proof of Theorem[3] the 
correctness of query answer follows from Theorem[2] 

If SPa(s,t) contains at least one vertex in VG k , then consider 
the two subpaths, SPg(s, x) and SPc(t, y), defined in the proof of 
Theorem[3](note that it is possible s = x and/or x — y and/or y = 
t). distG^s, x) and distG(t, y) can be answered using only label 
entries of vertices in L<fe and their ancestors in Gk for (H<fc, Gk). 
From the labeling mechanism, the occurrences and contents of such 
label entries will be identical in the labels of vertices in the first k 
levels of any vertex hierarchy H<j, k < j < h + 1, which is 
formed by limiting the height of a given H. Hence by Theoremf2] 
distG(s,x) and distG(t,y) are correctly initialized in Lines 1-3 
of AlgorithmfTj Thus, if we do not consider the pruning condition 
in Line 8, then Dijkstra's algorithm guarantees the distance from s 
(and t) to any vertex in Gk correctly computed, from which we can 
obtain distais, t). 

Now we consider query processing with pruning. Let [i = /j,*, 
and rmnf — min(FQ) and min r = min(iJQ), when the search 
stops. If n* is the value of [i initialized in Line 6, then we 
must have x — y £ (label(s) n label(t)) and hence n* = 
(distcis, x) + distant, x)). Otherwise, /i* is a value assigned 
to (i in Line 18 and suppose to the contrary that there exists a 
shorter path between s and t with length p such that p < fi*. Since 
the path passes through vertices in Gk, there must exist an edge 
(v, u) in Gk such thatp = distais, v) +uc k (v , it) + dtstG(u, t), 
distG(s,v) < mirif and distG(u,t) < min r . The existence of 
this edge is guaranteed because p < /i* < (mirif + min r ). Since 
distG(s,v) < mirij and distG(u,t) < min r , by Dijkstra's algo- 
rithm, both distcis, v) and distGit, u) have been computed when 
the search stops. Thus, /i should have been updated to a value not 
greater than p in Line 18 when the edge (v, u) was processed. This 
contradicts our assumption and hence /i* = distcis, t). □ 

6. ALGORITHMS 

In this section, we present the algorithms for index construction 
(i.e., vertex hierarchy construction and vertex labeling) and query 
processing using the vertex labels. In recent years, due to the pro- 
liferation of many massive real world networks, there has been an 
increasing interest in algorithms that handle large graphs. For pro- 
cessing large graphs that cannot fit in main memory, I/O cost usu- 
ally dominates. Thus, we propose I/O-efficient algorithms, from 
which the in-memory algorithms can also be easily devised. 



For the analysis of the I/O complexity in this section, we de- 
fine the following notation (4)- Let scan(N) = 0(N/B) and 
sort(N) = log M / B where N is the amount of data be- 
ing read or written from/to disk, M is the main memory size, and 
B is the disk block size (K B < M /2). 



6.1 Algorithm for Index Construction 

Although the vertex hierarchy, except Gk, is not required for 
query processing, it is needed for vertex labeling. There are two 
components, L and G, in the vertex hierarchy; thus, we have the 
following two main steps: (1) computing each independent vertex 
set Li G L, and (2) constructing each distance-preserving graph 
Gi G G. We first describe these two steps, followed by the con- 
struction of the overall vertex hierarchy, and finally the vertex la- 
beling. 



Algorithm 2: Constructing Li 



Input : A graph G; = (V Gi , E Gl , u Gi ) 
Output : Li and ADJ(Li) = {adj G . (v) : v € Li} 
i allocate a buffer for Li and ADJ(Li), and a buffer for L'; 

3 sort adj G i (v) in G£ in ascending order of deg G r (v); 

4 foreach adj G i (u) read in G\ do 
if u V then 

insert u into Li, and insert adj G i (u) into ADJ(Li); 

foreach v £ adj G i (u) do 
[_ if v L' then insert v into V ; 

if buffer for Li and ADJ {Li) is full then flush the buffer; 
if buffer for V is full then 
|^ scan G£ to delete all v £ L' and adj G i (v), and clear Z/; 



9 
10 
11 



(5.7.7 Constructing Li 

We want to maximize the size of each Li as this helps to mini- 
mize the number of levels h and hence also minimizes the vertex 
label size. However, maximizing Li means computing the maxi- 
mum independent set of Gi, which is an NP-hard problem. 

We adopt a greedy strategy to approximate the set of maximum 
independent set of Gi by selecting the vertex with minimum degree 
at each step [16], since small degree vertices have smaller number 
of dependent (i.e., adjacent) vertices and hence more vertices are 
left as candidates for independent set at the next step. Moreover, 
the greedy algorithm can also be easily extended to give an I/O- 
efficient algorithm that handles the case when Gi is too large to fit 
in main memory, as described in Algorithmic] 

The algorithm computes an independent set Li of Gi, together 
with the adjacency lists of the vertices in Li, denoted by ADJ(Li). 
We use ADJ(Li) to construct Gi+i in Section l6.1.2l To compute 
Li, we also keep those vertices that have been excluded from Li in 
the algorithm, as denoted by L' . We use a buffer to keep the current 
Li and ADJ (Li), and another buffer to keep L' . 

The algorithm first makes a copy of Gi, let it be G[, and then 
sorts the adjacency lists in G'i in ascending order of the vertex de- 
grees (i.e., the sizes of the adjacency lists). Then, we read G'i in 
this sorted order, i.e., the adjacency lists of vertices with smaller 
degrees are read first. For each adj G , (u) read, if u is not in L' , 
we include u into Li and add adj G , (u) to ADJ(Li). Meanwhile, 
we exclude all vertices in adj G , (u) from Li because of their de- 
pendence with u, i.e., we add these vertices to L' . The algorithm 
terminates when adj G , (it) for all u in G\ are read. 

If Gi is very large, it is possible that Li and ADJ(Li) are too 
large to be kept by a memory buffer. We can simply write the cur- 
rent Li and ADJ (Li) in the buffer to disk, and then clear the buffer 
for new contents of Li and ADJ(Li). However, when the buffer 
for L 1 is full, we cannot simply flush the buffer since it is possible 
that 3u G L 1 , adj G i (u) has not been read yet. To tackle this with- 
out incurring random disk accesses, we scan G'i to remove all the 
vertices currently in L' , together with their adjacency lists, from 
G'i, because these vertices have already been excluded from Li. 
Then, we clear the buffer for L' . 

If G'i can be resident in main memory, Lines 10-11 of Algorithm 
[2] are not necessary and we only need to scan G'i once. If G'i is 
resident on disk, it is easy to see that only sequential scans of G'i 
are needed and expensive random disk access is avoided. 

Algorithm |2]takes sort(\Gi\) I/Os to sort Gi. H\L'\ < M, we 
need another scan(\Gi\) I/Os to read Gi. Otherwise, 0(\L'\/M) * 
scan(\Gi\) I/Os are required. 



Algorithm 3: Constructing Gi 

Input : Gj_i, L;_i and ADJ(Li-i) 
Output : Gi 

1 G, (- Gi— i; 

2 remove from Gi all v £ i and adj Gi _ 1 (v); 

3 E A <- 0; 

4 foreach adj q._.(v) £ ADJ(Li-i) do 

5 foreach u, w £ adj Gi l (v), where u < to do 

fi insert into Ea the edges (u, w) and (w,u), with 

uj Gi (u, w) = u) Gi (w,u) = 
( w Gi_! (u,v) +WG < _ 1 («,m)); 

7 sort the edges in Ea by vertex ID's; 

8 scan Ea and Gi to add each edge (u, w) £ Ea to Gi, or update 
u) G . (u, w) with the smaller weight if (u, w) already exists in G;; 



6.1.2 Constructing Gi 

After obtaining Li-i and ADJ(Li-i), we use them to construct 
d. As shown in Algorithm [3] we first initialize Gi by removing 
the occurrences of all vertices in Lj-i, together with their adja- 
cency lists, from G<_i. However, the resultant Gi may not sat- 
isfy the distance preservation property. As discussed in Section 
14.11 the violation to this property can be fixed by the creation of a 
set of augmenting edges. We create these augmenting edges from 
ADJ(Li-i) as follows. 

When a vertex v G Li-i, together with adj Gi l (v), is removed 
from Gi-i to form Gi, what is missing in Gi is the path (u, v, w) 
for any u,w G adjQ.^ (v), where u < w (i.e., u is ordered be- 
fore w). Thus, to preserve the distance we only need to create the 
augmenting edge (u, w), and symmetrically (w, u) for undirected 
graphs, with weight (ojCi-i (u, v) + u^d-! (v, w)). 

We create all such augmenting edges in Lines 4-6 of Algorithm[3] 
and store them in an array Ea- Then, we sort the edges in Ea first 
in ascending order of the first vertex and then of the second vertex. 
Then, we scan both Ea and Gi (already sorted in its adjacency list 
representation), so that each edge in Ea is merged into Gi. If an 
edge in Ea is already in Gi, then its weight updated to the smaller 
value of its weight recorded in Ea and in Gi. 

If main memory is not sufficient, Line 2 of Algorithm [3] uses 
0(\L i - 1 \/M)*scan(\Gi-i\) I/Os, Lines 3-6 and 8 use scan(\G t \) 
I/Os, and Line 7 uses sort ( \ Gi \ ) I/Os, since | Ea \ < \Gi\. 

6.1.3 Constructing (L, G) 

The overall scheme to construct the vertex hierarchy, (L, G), is 
to start with the given Gi = G, and keep repeating the two steps 



Algorithm 4: Top-Down Vertex Labeling 



Input : (L, G) 
Output : label(v), Vv G Vg 
II Initialization of vertex labels 
1 for i = 1, k — 1 do 

2 
3 



foreach v G Li do 

L label(v) -f— {(v, 0)} U {(«, w<3 i (v, u)) : u G adj G . (f )}; 



4 Vd G Vb fc : label(v) <- {(u, 0)}; 



■) 

10 

n 

12 
13 
14 
15 

16 
17 



// Top-down vertex labeling 
for i = fc — 1, 1 do 

allocate buffer By and load label(v), for each ti G Li, in By, 
allocate buffer Bjj and load label(v), for each v G for 
i < j < k and for each t> G Vq. , in _B(j ; 
foreach block Bl do 
foreach block By do 

foreach label(v) in By do 
foreach label (u) in By do 

if («, d(v, u)) G label(v) then 

foreach («;, to)) G label(u) do 
if (mi, a!(u, w)) ^ label(v) then 
add (w,d(v, u) + ci(ii, mi)) to 



else 



Zaf>e/(?;); 
io) = 

min(d(v, w), d(y, u)+d(u, w)); 



of computing Li (Algorithm^ and constructing Gi (Algorithm^ 
until we reach a level k (see Section l5"Tl for the value of k). 

6.1.4 Top-Down Vertex Labeling 

Definition [3] essentially defines a procedure for computing 
label(v) for each v G Vg- However, a careful analysis will show 
that such a procedure, if implemented directly as it is described, 
involves much redundant processing as implied by the following 
corollary of Lemma|4] 

COROLLARY 1. Given a vertex v £ Li, wehave V[label(v)] = 

Wu(U. £ «« 0i („)VpoW(t»)]). 

PROOF. By Definition [3] Vli G adj G .(v), u will be included 
into V[label(v)]. From the result of Lemma [4] we have Vu G 
V[label(v)\, u is an ancestor of v by Definition [2] In the same 
way, we have Vu> G V[label(u)], w G V[label(v)] since w is 
then also an ancestor of v. Thus, Vu G adj a . (v), V[label(u)] C 
V[label(v)]. 

Next, \/w G V[/abe£(ii)]\{u}, w G V[ia&e/(u)] for some u G 
arf? G . (u) because w is included into V[/abe£(u)] from some u by 
Definition [3] and by the same procedure w will be included into 
V[label(u)] when we compute label(u). □ 

Corollary [T] implies that label(v) can be computed from 
label(u), for each u G arfj^. (u), instead of from scratch. Based 
on this, we design a more efficient top-down algorithm for vertex 
labeling as shown in Algorithm[4] 

The algorithm consists of two stages: initialization of vertex la- 
bels and top-down vertex labeling by block nested loop join, dis- 
cussed as follows. 

According to Corollary [TJ we only need to add (v, 0) and 
(w, oJGi (v, u)) for all u G adj G .(v) to label(v), and then derive 
other entries of label(v) from label(u) in the top-down process. 



For each v G Vg,. , however, we only need to add (v, 0) to label(v) 
since each v G VG k has only one ancestor, i.e., v itself. 

After the initialization, we compute the labels for the vertices 
starting from the top levels to the bottom levels, i.e., from level (k — 
1) down to level 1. We assume that the set of labels at each level 
may not be able to fit in main memory and hence use block nested 
loop join to find the matching labels, i.e., label(u) for each u G 
adj G . (v) when we process v at level i. Note that if u G a,dj G . (v), 
then (u, d(v,u)) G label(v) by the initialization. Thus, as shown 
in Lines 11-16, we derive the entries of other ancestors of v from 
label(u) directly, which essentially follows the rule specified in 
Definition[3] 

The complexity of the algorithm is apparently dominated by the 
top-down process. Let 6l(i) = \{label(v) : v G Li}\, and 
bu(i) = | \J l<J<k {label(v) : v € Lj} U {label (v) : v G Va h }\. 
The I/O complexity for the block nested loop join is given by 
(b L (i)/M) * (bu(i)/B). Thus, the I/O complexity of Algorithm^] 

isgivenb y O(E-=i((^«/^)*(^W/S))). 

6.2 Algorithm for Query Processing 

For processing large datasets, the vertex labels may not fit in 
main memory and are stored on disk. The entries in each label (v) 
are stored sequentially on disk and are sorted by the vertex ID's 
of the ancestors of v. Thus, label(s) n label(t) involves simple 
sequential scanning of the entries in label (s) and label(t). From 
our experiments, the vertex labels are small in size and retrieving a 
vertex label from disk takes only one I/O. The CPU time for query 
processing comes mostly from the bi-Dijkstra search. For a graph 
G = (V, E), a binary heap can be used and Dijkstra's algorithms 
runs in 0((\E\ + \V\) log \V\) time. 

7. EXPERIMENTAL EVALUATION 

We evaluate the performance of our method and compare with 
other related methods for processing P2P distance queries. All sys- 
tems tested were programmed in C++ and compiled with the same 
compiler. All experiments were performed on a computer with an 
Intel 3.3 GHz CPU, using 4GB RAM and a 7200 RPM SATA hard 
disk, running Ubuntu 1 1.04 Linux OS. 

We use the following datasets in our experiments: Web, 
BTC, as-Skitter, wiki-Talk and web-Google. BTC is an un- 
weighted graph, which is a semantic graph converted from the Bil- 
lion Triple Challenge 2009 RDF dataset (http://vmlion25.deri.ie/), 
where each vertex represents an object such as a person, a doc- 
ument, and an event, and each edge represents the relationship 
between two nodes such as "has-author", "links-to", and "has- 
title". Web (http://barcelona.research.yahoo.net/webspam) is a 
subgraph of the UK Web graph, where vertices are pages and 
edges are hyperlinks. The original graph G is directed and 
converted into undirected graph G in this way: if two ver- 
tices are reachable from each other within w hops in G, where 
w G {1,2}, they have an undirected edge with weight w in 
G. For there are many connected components in G, we extract 
the largest connected component for our experiments. As-Skitter 
is an Internet topology graph from traceroutes run daily in 2005 
(http://www.caida.org/tools/measurement/skitter). The wiki-Talk 
network contains all the users and discussions from Wikipedia till 
January 2008. Nodes in the network represent users of Wikipedia 
(http://www.wikipedia.org/) and an undirected edge between node 
i and node j means that user i has at least edited one talk page 
of user j or vice versa. In web-Google, nodes represent web 
pages and hyperlinks between them are represented by undi- 
rected edges. It was released for Google Programming Contest in 



2002 (http://www.google.com/programming-contest/). We list the 
datasets in Table [2] 





W\ 


\E\ 


Avg. Deg 


Max Deg 


Disk size 


BTC 


164.7M 


361. 1M 


2.19 


105,618 


5.6 GB 


Web 


6.9M 


113.0M 


16.40 


31,734 


1.1 GB 


as-Skitter 


1.7M 


22.2M 


13.08 


35,455 


200 MB 


wiki-Talk 


2.4M 


9.3M 


3.89 


100,029 


100 MB 


Google 


0.9M 


8.6M 


9.87 


6,332 


80 MB 



Table 2: Real datasets 

7.1 Results of Index Construction 

We first report the results for our index construction. We list 
the number of levels (k), the number of vertices (\VG k |) and edges 
(\Eo k |) of the graph Gk, the total label size, and indexing time in 
Table|3] We set the fc-selection criterion as follows: when the graph 
size of Gj+i is larger than 95% of the graph size of d, i.e. when 
\Vi\ + \Ei\ >= 0.95 * (\V i+ i\ + \Ei+i\), set k = i. This is to say 
that the independent set Li has introduced less than 5% of graph 
size reduction. We shall use 95% as our default threshold. 





k 




\E Gk \ 


Label size 


Indexing time 
(seconds) 


BTC 


6 


134K 


16.4M 


10.6 GB 


2513.73 


Web 


19 


242K 


14.5M 


13.1 GB 


2274.36 


as-Skitter 


6 


86K 


8.5M 


678.3 MB 


483.65 


wiki-Talk 


5 


14K 


2.4M 


152.5 MB 


239.48 


Google 


7 


87K 


2.5M 


199.5 MB 


35.13 



Table 3: Index construction results with threshold 0.95 

It is intuitively that with more levels in the vertex hierarchy, we 
can get a smaller size for graph Gk, bigger label size, and longer 
indexing time. This in turn affects the query time and we shall have 
more discussion in the next subsection. 

7.2 Results of Query Performance 

To assess query performance, we randomly generate 1000 
queries in each dataset and compute the average query time. The 
results for our datasets are shown in Table|4] The total time for each 
query is made up of two parts, the first part Time (a) being the time 
for retrieving labels for s and t if needed, the second part Time (b) 
is for the bi-Dijkstra search. We note that Time (a) for the dataset 
Web is much greater since the label size for Web is much bigger. 
Although BTC is a very large dataset, the query time is very short 
and this is due to the low average degree in the graph, which makes 
the bi-Dijkstra search highly efficient. Note that even though wiki- 
Talk and Google are much smaller in size, Time (a) is still above 
10ms, which is due to the speed of our hard disk, with a benchmark 
of 10ms per disk I/O. For these datasets, the label sizes are very 
small, and in fact they can be kept in main memory, in which case 
we will save the factor of Time (a) in the total time. We call this 
approach in-memory IS-LABEL,or IM-ISL for short. 





k 


Total query 


Time (a) 


Time (b) 






time(ms) 


(ms) 


(ms) 


BTC 


6 


11.55 


11.47 


0.08 


Web 


19 


28.02 


20.08 


7.94 


as-Skitter 


6 


20.05 


12.68 


7.37 


wiki-Talk 


5 


12.22 


10.85 


1.37 


Google 


7 


12.97 


10.37 


2.60 



Table 4: Query time with threshold 0.95: Time (a) denotes the 
time used for getting the label, Time (b) denotes the time used 
for bi-Dijkstra search 



Table[5] shows results of different query types using IS-LABEL. 
There are three types of queries: Type 1: Both s and t are in Gk', 
Type 2: One of s, / id in Gk', Type 3: Both s and t are not in Gk. 
We can see that Type 1 query has the shortest average query time 
for there is no need to lookup the labels, Type 2 query requires 
the lookup of the label of only one query vertex, and for Type 3 
we need to retrieve the labels of both query vertices. The time for 
running the bi-Dijkstra search on Gk does not vary much for the 
three types of queries. 





k 


Query 


Total query 


Time (a) 


Time (b) 






type 


time(ms) 


(ms) 


(ms) 


BTC 


6 


1 


0.08 


0.0 


0.08 






2 


5.85 


5.73 


0.12 






3 


9.03 


8.94 


0.09 


Web 


19 


1 


10.40 


0.0 


10.40 






2 


19.61 


10.14 


9.47 






3 


29.81 


20.37 


9.44 



Table 5: Query time for 3 types of queries: time (a) denotes the 
time used for getting the label, time (b) denotes the time used 
for bi-Dijkstra search 

When index construction is based on different k values, it will 
affect the querying time. We list the querying results for graph 
BTC and Web with different k values in Table[6] The greater k is, 
the smaller the size of graph Gk, which leads to shorter time for 
the bi-directional dijkstra algorithm. However, the time for scan- 
ning labels will increase with the increase of the label size with a 
larger k. Considering all factors, we can conclude that the k values 
that we have chosen automatically as shown in Table [3] are highly 
effective. 





k 


Wa h \ 


\Ea k \ 


Label size 


Indexing 


Query 








time(s) 


time(ms) 


BTC 


5 


I67K 


I7.2M 


7.2 GB 


1555.24 


10.45 


BTC 


6 


134K 


16.4M 


10.6 GB 


2513.73 


11.55 


BTC 


7 


114K 


15. 8M 


17.1 GB 


7227.40 


12.37 


Web 


18 


260K 


15.2M 


12.2 GB 


2115.31 


30.72 


Web 


19 


242K 


14.5M 


13.1 GB 


2274.36 


28.02 


Web 


20 


226K 


13. 8M 


13.9 GB 


2485.24 


33.65 



Table 6: Index construction time, label size, Gk size and query 
time with different k values 





k 


\Va k \ 


\EG k \ 


Label size 


Indexing 


Query 










time(s) 


time(ms) 


BTC 


5 


167K 


17.2M 


7.2 GB 


1818.21 


10.64 


Web 


7 


808K 


31. 1M 


1.6 GB 


752.69 


40.85 


as-Skitter 


4 


160K 


9.3M 


221.9 MB 


246.69 


18.98 


wiki-Talk 


4 


17K 


2.4M 


99.3 MB 


182.32 


11.38 


Google 


6 


107K 


2.7M 


127.3 MB 


25.57 


12.96 



Table 7: Index Construction time, label size, Gk size, and query 
time with threshold 0.9 

To investigate how the fc-selection criterion may impact the over- 
all performance, we examine another setting where we set k — i 
when (\Gi |/|G,:_i |) > 90%. We list the indexing construction re- 
sults of using 90% as our threshold in Table [7] We can see that a 
larger threshold gives rise to smaller k values, which lead to larger 
sizes for Gk, smaller label sizes and shorter indexing times. How- 
ever, the query time in the case of dataset Web becomes greater, 
which is a trade-off for the smaller indexing costs. Depending on 
the available resources and application requirements, the threshold 
can be tuned to a desirable value. However, it can be noted that 
we maintain very good query time as we vary the choices of the 



threshold. This shows that our high quality query performance is a 
robust behavior. 

7.3 Comparison with Other Methods 

There exist a number of recent works on point-to-point distance 
querying. The most recent work by Jin et al 1 17 1 shows that their 
method out-performs other state-of-the-art approaches. However, 
the space requirement of their program exceeds our RAM capacity 
for the larger datasets, while for our smaller datasets, the indexing 
time was prohibitively long. Note that their results recorded over 70 
hours of labeling time for a small dataset with only 694K vertices 
and 312K edges 1 17 1. We next tried to compare with the method 
TEDI in [32 ] . However, TEDI ran out of memory for each of our 
datasets due to a very large root node in the tree decomposition. 





IS -LAB EL 


IM-ISL 


VC-Index(P2P) 


IM-D1J 


BTC 


11.55 ms 




4246.09 ms 




Web 


28.02 ms 




31655.77 ms 


430.67 ms 


as-Skitter 


20.05 ms 


7.15 ms 


3712.33 ms 


23.16 ms 


wiki-Talk 


12.22 ms 


1.23 ms 


553.94 ms 


9.97 ms 


Google 


12.97 ms 


2.44 ms 


1285.25 ms 


9.09 ms 



Table 8: Query time of IS-LABEL, in memory IS-LABEL(IM- 
ISL), VC-Index (converted for P2P) and IM-DIJ 





Index construction 
time (seconds) 


Index size 


BTC 


6221.44 


3.1 GB 


Web 


3544.38 


3.0 GB 


as-Skitter 


1013.07 


486.5 MB 


wiki-Talk 


52.79 


137.1 MB 


Google 


70.37 


211.3 MB 



Table 9: Indexing costs for VC-Index 



We find that no known point-to-point distance querying mech- 
anism can handle our data sizes, hence we try to compare with 
the best related method that can be converted to work for point-to- 
point querying. The most efficient such method is the VC-Index 
proposed by Cheng et al in II II . Since VC-Index is for single 
source shortest paths queries, we modified the source code to make 
it work specifically for point to point distance queries by making 
the program stop once the distance from s to t is found. We com- 
pare our method with this converted VC-Index method by taking 
the average query time over 1000 randomly generated queries. For 
the datasets that can fit into main memory, we also compare our 
method with the in-memory bidirectional Dijkstra search (IM-DIJ). 
We list the average query times in Table [8] In Table [9] we list the 
indexing costs of VC-Index. From the experimental result, first we 
notice that in-memory bi-Dijkstra cannot work for the dataset BTC 
since it exceeds the memory capacity. For the smaller datasets, 
in-memory IS-LABEL (IM-ISL) is faster than the in-memory bi- 
Dijkstra method (IM-DIJ), and IS-LABEL is much faster than IM- 
DIJ for the larger dataset Web. Although VC-Index can handle all 
the datasets including the case where the data does not fit in main 
memory, we find that IS-LABEL is many times faster than VC- 
Index in the query time. The speedup is especially significant for 
the massive graphs. IS-LABEL is 368 times faster for BTC, and 
1 130 times faster for Web. Meanwhile, the index construction time 
of IS-LABEL is also less than that of VC-Index. 

8. PATH QUERIES, DIRECTED GRAPHS, 
AND UPDATE MAINTENANCE 



In this section, we discuss the extension of our method to answer 
shortest-path queries and to handle directed graphs. We also briefly 
discuss how update maintenance can be processed when the input 
graph is updated dynamically. 

8.1 Shortest-Path Queries 

To answer a P2P shortest-path query, we need to keep some extra 
information in the vertex labels. When an augmenting edge (u, w) 
is created in d with cjQ f ( u , w ) = wg,_i ( u , v ) + ^G^-i ( v , w ), 
we also keep the intermediate vertex v along with the augmenting 
edge to indicate that the edge represents the path (u,v,w). Note 
that (u, v) and (v, w) are edges in G»_i, which in turn can be aug- 
menting edges. In the labeling process, instead of adding the entry 
(id, d(u, w)) to label(u), we also attach the intermediate vertex v 
(if any) for (u, w). Thus, the entry becomes a triple (w, d(u, w), v) 
(or (w, d(u, w), <f>), if there is no intermediate vertex). Note that 
we keep the graph Gk, and thus the intermediate vertex of any aug- 
menting edge in Gk is directly attached to the edge. 

Given a query, s and t, if the query is of Type 1, the answer is de- 
termined by two label entries, (w, d(s, w),v) and (w, d(t, w),v'). 
If v 7^ 4> (similarly for v'), we form two new queries (s, v) and 
(v, w). In this way, we recursively form queries until the interme- 
diate vertex in a label entry is (j>. It is then straightforward to obtain 
the resulting path by linking all the intermediate vertices. If the 
query is of Type 2, then the answer is determined by two label en- 
tries and a path in Gk- The subpaths from the two label entries are 
derived in the same way as we do for a Type 1 query. The path in 
Gk is expanded into the original path in G by forming new queries, 
"u and v" and "t> and w", for any augmenting edge (u, w) with 
the intermediate vertex v. For each such query, the corresponding 
subpath is obtained as discussed above. The I/O complexity of the 
overall process is given by 0(\SPg{s, t)\), where \SPc{s,t)\ is 
the number of edges on SPa{s, t). 

8.2 Handling Directed Graphs 

To handle directed graphs, we need to modify the vertex hierar- 
chy construction as well as the vertex labeling. Let us use (u, v) 
to indicate an edge from u to v in this subsection. The concept of 
independent set can be applied in the same way by simply ignoring 
the direction of the edges. However, for distance preservation, we 
create an augmenting edge (u, w) at Gi only if 3v £ Li-i such 
that (u, v), (v, w) £ Ea i _ 1 . We distinguish two types of ances- 
tors for a vertex v. in-ancestors and out-ancestors. The definition 
of in-ancestors is similar to that of ancestors in undirected graphs, 
except that we only consider edges from higher-level vertices to 
lower-level vertices. Analogously, the definition of out-ancestors 
concerns edges going from lower-level vertices to higher-level ver- 
tices. 

The labeling needs to handle two directions. For each ver- 
tex v, we need two types of labels defined as follows. The 
in-label of a vertex v £ Vg, denoted by LABEL in (v), 
is defined as LABELi n (v) = {(u, dista(u,v)) : u £ 
Vg is an in-ancestor of v}. The out-label of a vertex v £ 
Vg, denoted by LABEL out (v), is defined as LABEL out (v) = 
{(u, distG{v, u)) : u £ Vg is an out-ancestor of v}. 

Given a P2P distance query with two input vertices, s and t, we 
compute X = LABEL out (s) n LABEL m (t) and then answer the 
query in the same way as given in EquationQ] 

8.3 Update Maintenance 

When the input graph is updated, we want to update the vertex 
labels incrementally rather than to re-compute them from scratch. 
We consider the cases where vertices, along with their adjacency 



lists, are inserted or deleted in the graph. For insertion of a new 
vertex u, we add u to Gk- Next we consider each vertex v in the 
adjacency list adj G (u) of u. If v is in Gk, then we simply add the 
edge (u, v) to Eo k with weight ug(u, v). Otherwise, let v 6 Li. 
We add (u, u>g(u, v)) to label(v). We also need to add u to the 
descendants of v (a vertex w is a descendant of v if u is an ancestor 
of w). The descendants of v can be viewed as vertices in a tree 
rooted at v. We traverse this tree so that the entry (u, d(u, w)) is 
added to or modified in label(w), where w is a descendant of v, so 
that the value of d(u, w) is set to or decreased to the accumulated 
distance of uj{u, v) + d(v, vi), ...d(vi,w), where v, vi, w is a 
path in the tree. The I/O complexity is given by the number of 
descendants of u. Next we consider the deletion of a vertex u. If 
u is in Gk and no label of other vertices contains u, then u can 
simply be deleted from the adjacency lists of all its neighbors in 
Gk ■ Otherwise, we look for the descendants of u and remove the 
entry of u in the label of each descendant. In this case, the I/O 
complexity is determined by the number of descendants of u. The 
above lazy update mechanism would have little impact on the query 
performance for a moderate amount of updates, and we can rebuild 
the index periodically. 

9. CONCLUSION 

In this paper, we introduce an effective disk-based indexing 
method named IS-LABEL for distance and shortest path querying 
in massive graphs. The directed graph version of our method simul- 
taneously solves the fundamental problem of reachability. Given 
the low costs of IS-LABEL in index construction and querying for 
both massive undirected and massive directed graphs, we expect 
our method to handle large graphs for reachability queries. 
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